Decision Tree using Python (sklearn):
Kumar Rahul
Load the dataset in Jupyter Notebook using pandas Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why? Split the data into training set and test set. Use 80% of data for model training and 20% for model testing.
- Building the Decision Tree model and understand the dummy variable coding while working on DT models
- Visualize the decision tree and intrepret the decision tree business rules
- Validate the outcome of the model on test set and report precision, recall, F-score on test set
- Understand the concept of pipeline
Some advantages of decision trees are:
- Simple to understand and to interpret. Trees can be visualised.
- Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that the sklearn decision tree module does not support missing values.
- Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
- Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic.
The disadvantages of decision trees include:
- Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
Exhibit 1
| Sl.No. | Name of Variable | Variable Description |
|---|---|---|
| 1 | Candidate reference number | Unique number to identify the candidate |
| 2 | DOJ extended | Binary variable identifying whether candidate asked for date of joining extension (Yes/No) |
| 3 | Duration to accept the offer | Number of days taken by the candidate to accept the offer (continuous variable) |
| 4 | Notice period | Notice period to be served in the parting company before candidate can join this company (continuous variable) |
| 5 | Offered band | Band offered to the candidate based on experience and performance in interview rounds (categorical variable labelled C0/C1/C2/C3/C4/C5/C6) |
| 6 | Percentage hike (CTC) expected | Percentage hike expected by the candidate (continuous variable) |
| 7 | Percentage hike offered (CTC) | Percentage hike offered by the company (continuous variable) |
| 8 | Percent difference CTC | Percentage difference between offered and expected CTC (continuous variable) |
| 9 | Joining bonus | Binary variable indicating if joining bonus was given or not (Yes/No) |
| 10 | Gender | Gender of the candidate (Male/Female) |
| 11 | Candidate source | Source from which resume of the candidate was obtained (categorical variables with categories Employee referral/Agency/Direct) |
| 12 | REX (in years) | Relevant years of experience of the candidate for the position offered (continuous variable) |
| 13 | LOB | Line of business for which offer was rolled out (categorical variable) |
| 14 | DOB | Date of birth of the candidate |
| 15 | Joining location | Company location for which offer was rolled out for candidate to join (categorical variable) |
| 16 | Candidate relocation status | Binary variable indicating whether candidate has to relocate from one city to another city for joining (Yes/No) |
| 17 | HR status | Final joining status of candidate (Joined/Not-Joined) |
To know the environment with the python kernal
import sys, os
sys.executable
We are going to use below mentioned libraries for data import, processing and visulization. As we progress, we will use other specific libraries for model building and evaluation.
import pandas as pd
import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt
import graphviz
#the output of plotting commands is displayed inline within frontends like in Jupyter notebook
%matplotlib inline
modify the ast_note_interactivity kernel option to see the value of multiple statements at once.
import os
os.getcwd()
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.options.display.max_columns = None
raw_df = pd.read_csv( "../HR_case/data/IMB533_HR_Data_No_Missing_Value.csv",
sep = ',', na_values = ['', ' '])
raw_df.columns = raw_df.columns.str.lower().str.replace(' ', '_')
raw_df.head()
Dropping SLNo and Candidate.Ref as these will not be used for any analysis or model building. To know about all the possible operations which can be performed on pandas dataframe:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
if set(['slno','candidate_ref']).issubset(raw_df.columns):
raw_df.drop(['slno','candidate_ref'],axis=1, inplace=True)
raw_df.head()
raw_df.info()
raw_df.status.value_counts()
raw_df.describe(include='all').transpose()
To get a help on the features of a object
#?raw_df.status.value_counts()
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The dropna() function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.
filter_df = raw_df.dropna(axis=0, how='any', thresh=None,
subset=None, inplace=False)
list(filter_df.columns )
We will first start by printing the unique labels in categorical features
numerical_features = ['duration_to_accept_offer','notice_period','pecent_hike_expected_in_ctc',
'percent_hike_offered_in_ctc','percent_difference_ctc','rex_in_yrs','age']
categorical_features = ['doj_extended','offered_band','joining_bonus','candidate_relocate_actual',
'gender','candidate_source','lob','location','status']
for f in categorical_features:
print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
print("The values in {} is \n{}\n".format(f, filter_df[f].value_counts()))
Looking at the feature line of business it seems that EAS, Healthcare and MMS does not have enough observations and may be clubbed together
filter_df['lob']=np.where(filter_df['lob'] =='EAS', 'Others', filter_df['lob'])
filter_df['lob']=np.where(filter_df['lob'] =='Healthcare', 'Others', filter_df['lob'])
filter_df['lob']=np.where(filter_df['lob'] =='MMS', 'Others', filter_df['lob'])
filter_df.lob.value_counts()
We will use groupby function of pandas to get deeper insights of the behaviour of people Joining or Not Joining the company. We will write a generic function to report the mean by any categorical variable.
def group_by (categorical_features):
return filter_df.groupby(categorical_features).mean()
group_by("doj_extended")
group_by("status")
Plot can be done using the callable functions of
- pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)
- matplotlib library (https://matplotlib.org/) or
- seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.
Write a custom function to create bar plot to visulaize the average of numeric features w.r.t each categorical feature. Say, average number of days to accept the offer w.r.t status as joined vs. not joined.
def bar_plot(xlabel,ylabel,xcnt,ycnt):
sn.barplot(x = xlabel, y = ylabel, data= filter_df, ax = axes[xcnt,ycnt])
fig.show()
numerical_features_set = ['duration_to_accept_offer','notice_period','age']
categorical_features_set = ['offered_band','gender','status']
xcnt=0
ycnt = 0
fig, axes = plt.subplots(3,3, figsize=(12,9))
fig.subplots_adjust(hspace = 1, wspace=.5)
for c in categorical_features_set:
for n in numerical_features_set:
bar_plot(c,n,xcnt,ycnt)
ycnt = ycnt+1
xcnt = xcnt+1
ycnt=0
Remove the response variable from the dataset¶
X_features = list(filter_df.columns)
X_features.remove('status')
X_features.remove('pecent_hike_expected_in_ctc')
X_features.remove('percent_hike_offered_in_ctc')
X_features.remove('candidate_relocate_actual')
X_features
Useful read on the dummy variable code while working on decision tree classifier:
https://datascience.stackexchange.com/questions/47638/in-which-cases-shouldnt-we-drop-the-first-level-of-categorical-variables https://towardsdatascience.com/understanding-decision-tree-classification-with-scikit-learn-2ddf272731bd https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest
encoded_X_df = pd.get_dummies(filter_df[X_features])
encoded_Y_df = pd.get_dummies(filter_df['status'])
Using Label Encoder. But not to be used in DecisionTree Classifier.
#from sklearn import preprocessing
#le = preprocessing.LabelEncoder()
#for i in range(0,filter_df.shape[1]):
# if filter_df.dtypes[i]=='object':
# filter_df[filter_df.columns[i]] = le.fit_transform(filter_df[filter_df.columns[i]])
Y = encoded_Y_df.filter(['Joined'], axis =1)
X = encoded_X_df
The train and test split can also be done using the sklearn module
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 42)
tree.DecisionTreeClassifier?
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt_model = dt.fit(X_train,y_train)
Issue with graphviz, you may refer to solutions here: https://stackoverflow.com/questions/28312534/graphvizs-executables-are-not-found-python-3-4
#!pip install six
#!conda install graphviz
#!pip install pydotplus
#from sklearn.externals.six import StringIO
from six import StringIO
from IPython.display import Image
from sklearn import tree
import pydotplus
output_file = StringIO()
vis_tree = tree.export_graphviz(dt_model,out_file=output_file,
feature_names=X_train.columns,
class_names=['Not Joined','Joined'],
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(output_file.getvalue())
graph.write_png("hr_decision_tree.png")
Image(graph.create_png())
Other way to visulaize
import graphviz
vis_tree = tree.export_graphviz(dt_model,out_file=None,
feature_names=X_train.columns,
class_names=['Not Joined','Joined'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(vis_tree)
graph.render("hr_decision_tree")
graph.view
graph
The prediction can be carried out by defining functions as well. Below is one such instance wherein a function is defined and is used for prediction
def get_predictions ( test_class, model, test_data ):
y_pred_df = pd.DataFrame( { 'actual_class': test_class,
'predicted_value': dt_model.predict(test_data)})
return y_pred_df
Giving label to the Y column of the test set by using the dictionary data type in python. This is being done for the model which was built using dummy variable coding. It will be used to generate confusion matrix at a later time
ser = y_test
status_dict = {1:"Joined", 0:"Not Joined"}
y_test_df = ser.replace(dict(Joined=status_dict))
y_test_df.rename({'Joined': 'actual'}, axis='columns', inplace=True )
y_test_df.head()
dt_model_df = pd.DataFrame(get_predictions(y_test_df.actual, dt_model, X_test))
dt_model_df.head()
dt_model_df['predicted_class'] = dt_model_df.predicted_value.map(lambda x: 'Joined' if x >= 1 else 'Not Joined')
dt_model_df.head()
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print("The dt model with dummy variable coding output: ")
confusion_matrix(dt_model_df.actual_class, dt_model_df.predicted_class)
dt_report = (classification_report(dt_model_df.actual_class, dt_model_df.predicted_class))
print(dt_report)
print("Precision for Joined Class is %.2f " %(1123/(1123+186)))
print("Recall for Not Joined Class is %.2f " %(154/(336+154)))
print("Macro Average Recall is %.3f" % np.divide((0.86 + 0.31),2))
print("Weighted Recall is %.2f" % (0.86*(1459/(1459+340))+0.31*(340/(1459+340))))
print("Recall for Joined Class is %.2f " %(1123/(1123+336)))
print("Recall for Not Joined Class is %.2f " %(154/(154+186)))
print("Macro Average Recall is %.2f" % np.divide((0.77 + 0.45),2))
print("Weighted Recall is %.2f" % (0.77*(1459/(1459+340))+0.45*(340/(1459+340))))
def draw_cm( actual, predicted ):
plt.figure(figsize=(9,9))
cm = metrics.confusion_matrix( actual, predicted )
sn.heatmap(cm, annot=True, fmt='.0f', xticklabels = ["Joined", "Not Joined"] ,
yticklabels = ["Joined", "Not Joined"],cmap = 'Blues_r')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Classification Matrix Plot', size = 15);
plt.show()
The classification matrix plot as reported by model 1 with dummy variable coding is:
draw_cm( dt_model_df.actual_class, dt_model_df.predicted_class )
def measure_performance (clasf_matrix):
measure = pd.DataFrame({
'sensitivity': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)],
'specificity': [round(clasf_matrix[1,1]/(clasf_matrix[1,0]+clasf_matrix[1,1]),2)],
'recall': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)],
'precision': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[1,0]),2)],
'overall_acc': [round((clasf_matrix[0,0]+clasf_matrix[1,1])/
(clasf_matrix[0,0]+clasf_matrix[0,1]+clasf_matrix[1,0]+clasf_matrix[1,1]),2)]
})
return measure
cm = metrics.confusion_matrix(dt_model_df.actual_class, dt_model_df.predicted_class)
dt_model_metrics_df = pd.DataFrame(measure_performance(cm))
dt_model_metrics_df
Pipelines can be used to perform a sequence of steps before model building. It expects a sequence to be passed as list of tuples. Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.pipeline import Pipeline
from tqdm import tqdm_notebook as tqdm
seq_steps = [('dt', tree.DecisionTreeClassifier())]
pipeline = Pipeline(seq_steps)
print(pipeline)
To report the performance on the selected KPI use sklearn.metrics.SCORERS.keys() to get the list of all the metrics and pass the relevant one in RandomizedSearchCV or GridSearchCV
Implement Grid Search to fine tune the model
criterion = ['gini','entropy'] #2
max_features = [None, 'auto', 'log2','sqrt'] #4
max_depth = [2,3,4] #3
min_samples_split = [50,75,100,120] #4
min_samples_leaf = [50, 75] #2
class_weight = ['balanced',None] #2
# Create the grid
random_grid = {'dt__criterion': criterion,
'dt__max_features' : max_features,
'dt__max_depth' : max_depth,
'dt__min_samples_split': min_samples_split,
'dt__min_samples_leaf' : min_samples_leaf,
'dt__class_weight' : class_weight}
random_grid
from sklearn.metrics import SCORERS
SCORERS.keys()
# Use the random grid to search for best hyperparameters
from sklearn.model_selection import GridSearchCV
#tree_model = tree.DecisionTreeClassifier(random_state=42)
# Random search of parameters, using 3,4 and 5 fold cross validation,
for cv in tqdm(range(3,6)):
best_tree_model = GridSearchCV(estimator = pipeline, param_grid = random_grid,
scoring = "balanced_accuracy", cv = cv)
# Fit the random search model
best_tree_model.fit(X_train, y_train.values.ravel())
print("performance for %d fold CV = %2.2f" %(cv, best_tree_model.score(X_test,y_test)))
print("best parameters for %d fold CV" %(cv))
print(best_tree_model.best_params_)
best_tree_model.best_params_
best_tree_model_df = pd.DataFrame(get_predictions(y_test_df.actual, best_tree_model, X_test))
best_tree_model_df.head()
best_tree_model_df['predicted_class'] = best_tree_model_df.predicted_value.map(lambda x: 'Joined'
if x >= 1 else 'Not Joined')
best_tree_model_df[0:10]
draw_cm( best_tree_model_df.actual_class, best_tree_model_df.predicted_class )
cm = metrics.confusion_matrix(best_tree_model_df.actual_class, best_tree_model_df.predicted_class)
best_dt_model_metrics_df = pd.DataFrame(measure_performance(cm))
best_dt_model_metrics_df
End of Document